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Abstract 

> . 

The class of Schoenberg transformations, embedding Euclidean dis- 
tances into higher dimensional Euclidean spaces, is presented, and de- 
C^l ' rived from theorems on positive definite and conditionally negative def- 

inite matrices. Original results on the arc lengths, angles and curvature 
of the transformations are proposed, and visualized on artificial data 
, sets by classical multidimensional scaling. A simple distance-based dis- 

f^) ' criminant algorithm illustrates the theory, intimately connected to the 

r-jl . Gaussian kernels of Machine Learning. 
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1 Introduction 



Schoenberg transformations are elementwise mappings of Euclidean dis- 
tances into new Euclidean distances, embeddable in a higher dimensional 
space. Their potential in Data Analysis seems evident in view of the om- 
nipresence of Euclidean dissimilarities in Multidimensional Scaling (MDS), 
Factor Analysis, Correspondence Analysis or Clustering. Yet, despite its re- 
spectable age (Schoenberg 1938a), the properties and the very existence of 
this class of transformations appear to be little known in the Data Analytic 
community. 
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Non-linear embeddings of original data into higher dimensional feature 
spaces are familiar in the Machine Learning community, which however bases 
its formalism upon kernels, which are positive definite (p.d.) matrices, rather 
than on squared Euclidean distances, which are conditionally negative defi- 
nite (c.n.d.) matrices with a null diagonal. 

Some aspects of the correspondence between p.d. and c.n.d. matrices 
are well-known in Data Analysis, and lie at the core of classical MDS (The- 
orems [T] and [2]) . Other aspects (Theorem [3]) , central to the derivation of 
Schoenberg transformations (Definition [2]) , are less notorious. Section [2] is 
a self-contained review of all those results, scattered in the literature, to- 
gether with their proofs. Section [3] analyses some of the general properties 
of Schoenberg transformations, and yields original results about angles, arc 
lengths and curvatures. Section [4] illustrates the non-linear and spectral 
properties of the transformations on two artificial data sets - the grid and 
the rod. An elementary yet efficient distance-based linear discriminant al- 
gorithm is presented in Section [5j Section [6] proposes in conclusion to revisit 
the Machine Learning formalism in terms of Euclidean distances, rather than 
in terms of kernels 

2 Definitions and Theorems 
2.1 Preliminaries 

Classical multidimensional scaling (MDS) (e.g. Borg and Groenen 1997) can 
be performed iff the eigenvalues of the so-called matrix of scalar products are 
non-negative. For concision sake, we shall refer to such a matrix as positive 
definite (instead of "semi-positive definite"), while a strictly positive definite 
matrix will be characterized by strictly positive eigenvalues. 

Vectors are meant as column vectors. / denotes the identity matrix, 
and 1 the unit vector, all components of which being unity. Depending 
upon context, the "prime" either denotes the transpose of a matrix, or the 
derivative of a scalar function. 

Definition 1 A real symmetric n x n matrix C = (cij) is said to be 

• positive definite (p.d.) if (z,Cz) = c ij z i z j > f or a ^ vectors 

• conditionally negative definite (c.n.d) if (z,Cz) = CijZiZj < for 
all x £M. n such that Yl?=i z i = 0- 
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Consider a signed distribution a on n objects, that is a vector obeying 
Y17=i a * = 1> WFiere some components are possibly negative. Consider also 
the n x n centering matrix H(a) = I — la', with components 5ij — aj. Let 
C be a symmetric n x n matrix, and define the matrix 

B(a) = - l -H{a)CH'(a) . (1) 

Theorem 1 (Young and Householder 1938; Schoenberg 1938b) 

For any signed distribution a, 

B{a) is p.d. <^ C is c.n.d. 

Proof: first observe that if B(a) is p.d., then B(a) is also p.d. for any other 
signed distribution a, in view of the identity B(a) = H{a)B{a)H' (a), itself a 
consequence of H{a) = H{a)H(a). Also, for any z, (z,B(a)z) = —^(y,Cy) 
where the vector y = H'(a)z obeys J2i V% = f° r an y z -, showing "•<=". Also, 
y = H'(a)y whenever = °> an d hence (y,B(a)y) = -|(y,Cy), thus 

demonstrating "=>". □ 

Theorem 2 (classical MDS) Lei C = (c^) 6e a symmetric nxn matrix. 
Define the associated zero-diagonal matrix C = (c\j) as c\j = Cij — ^cu — ^Cjj. 
Then 

B(a) = -- H(a) C H'(a) and ctj = b u (a) + bjj(a) - 2^- (a) . (2) 

Moreover, C is c.n.d. iff C is c.n.d. In this case, the components Cjj are 
"isometrically embeddable in I2 ", that is representable as squared Euclidean 
distances Dy between n objects as 

v 

Cij = = ^ ^ {Xjg %ja) J ' = 1) • • • j n (3) 
a=l 

where the object coordinates can be chosen as 

Xia = \/K{a) Uia(a) (4) 

where the X a are the diagonal components of the diagonal matrix A (a) and 
Ui a {a) are the components of the orthogonal matrix U{a) occurring in the 
spectral decomposition B(a) = U(a)A(a)U' '(a). 
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Proof: the first identity in ([2]) follows from H(a)l = 0, and the second one 
from bu(a) + bjj(a) — 26^ (a) = Cjj — \ca — \cjj, itself a consequence of the 
form ([1]) 6y(a) = — \cij + 7« + 7j for some vector 7. The next assertion 
follows from (y, Cy) = (y, Cy) whenever ^ Vi = 0> an d identity (J3|) can be 
shown to amount to the second identity ([2]) by direct substitution. □ 

The p.d. nature of B(a) (Theorem [T]) is crucial to insure the non- 
negativity of the eigenvalues X a . Identity H'(a)a = yields B(a)a = 0. 
Hence, at least one eigenvalue is zero and p < n — 1 in ([3]). 

Theorems [T]and [2] show that any p.d. matrix B, or equivalently any c.n.d. 
matrix C, define a unique set of squared Euclidean distances D between 
objects (Torgerson 1958; Gower 1966). The latter can be shown (e.g. from 
PI) to obey the celebrated Huygens principle, namely 

n 1 n 

ajDij = D ia + A a A a = - didjDij (5) 

j=l i,j=l 

where Di a denotes the squared distance between object i (with coordinates 
Xi) and the a-barycenter defined by the coordinates x a = Ylj a j x r Also, 
A a > interprets as the average dispersion of the cloud, provided a is a 
non-negative distribution representing the relative weights of the objects. 
In the general case of a signed distribution, A a is still well defined, but can 
be negative. 

The squared Euclidean distance between the barycenters x a and x^ as- 
sociated to two signed distributions a and b can also be shown to satisfy 

Dab = -^^{cii - bi)(aj - bj)Dij (6) 

ij 

which directly demonstrates the c.n.d. nature of D (since Zi = Oj — b{ obeys 
J2i z i = 0). Also, ([6]) entails ([5]) with the choice bj = 5jk for some k. 
Substituting ([5]) in (P) yields 

bij(a) = ~^( D ij ~ D ia ~ Dja) 

which, by the cosine theorem, is the matrix of the scalar products between 
Xi and Xj as measured from the origin x a . Low-dimensional factorial recon- 
structions (that is limiting the sum in (|3|) to the largest eigenvalues) express 
a maximum amount of tr(£?(a)) = Yli^ia- This quantity, without direct 
interpretation, is proportional to the uniform dispersion of the coordinates 
cloud with respect to the point x a . The dispersion ti(B(a)) is minimum 
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when a is the uniform distribution, a standard choice in classical MDS (see 
e.g. Mardia et al. 1979). 

Concentrating the mass of a on a single existing object, typically the last 
one, is often proposed for computational convenience. Other prescriptions 
consider Oj as proportional to the precision of measurement of object i (see 
e.g. Borg and Groenen 1997), or set a« = for objects whose behavior 
might influence excessively the overall configuration, as in the treatment 
of " supplementary elements" in Correspondence Analysis (see e.g. Benzecri 
1992; Lebart, Morineau and Piron 1998; Meulman, van der Kooij and Heiser 
2004; Greenacre and Blasius 2006). Other choices such as the circumcenter 
or the incenter are discussed in Gower (1982). Note that the signed nature 
of a allows to define an external origin x a lying outside the convex hull of 
the n points, resulting in > for all pairs. 

As a matter of fact, the choice of the origin a and the choice of the 
object weights / constitute two distinct operations, as made explicit by 
the following generalization of classical MDS (Cuadras and Fortiana 1996; 
Bavaud 2006, 2009): 

Theorem 3 (weighted MDS) Consider n weighted objects with positive 
weights f>0 normalized to Y^i fi = ^> together with a (symmetric, non- 
negative, zero-diagonal) pairwise dissimilarity matrix D = (Dij). Let U = 
(jtij) = diag(f), i.e. iiij = fi5ij. Then D is squared Euclidean iff the matrix 
of weighted scalar products 

K(a) = -- VTlH(a) D H'(a) VTl that is Ky(a) = ^fJjhA a ) 
is p.d. The objects coordinates can be chosen as 



Xia = J ^jf- u ia (a) with = ^2(x ia - x ja ) 2 (7) 

a=l 

where the eigenvalues X a (a) and eigenvectors Ui a (a) obtain from the spectral 
decomposition of K{a) = U(a)A(a)U'(a). Moreover, the corresponding low- 
dimensional factorial reconstruction, retaining in ([?[) only the components 
a associated with the largest eigenvalues, express a maximum proportion of 
the total inertia relatively to a, namely 

v 

tr{K{a)) = Y. X « = Y1 f iD ™ = A f + D f* ■ ( 8 ) 
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The proof follows from the definitions and Theorem[2]by direct substitution. 
The last identity is a consequence of (|5|), and shows in particular the total 
inertia to be minimum for a = f, as expected. When / is uniform, the 
eigenvalues in Theorems [2] and [3] coincide up to a factor n. 

2.2 The class of Schoenberg transformations 

If A = (a>ij) and B = (bij) are p.d. matrices of same order, so are cA for 
c > 0, (tidijtj) for any vector t (cf. Theorem [3]), A + B, AB as well as the 
element- wise product or Hadamard product A o B with components ciijbij. 
The latter result (Schur theorem), can be first proved for rank-one p.d. 
matrices, and then extended to arbitrary ranks by matrix addition (see e.g. 
Horn and Johnson 1991; Bhatia 2006). Combining those facts, one obtains 
that the Hadamard integral power A op with components a- (where p£N) 
or the Hadamard exponential exp(o^4) with components exp(ajj) are p.s.d. 
However, A oX is generally not p.d. for A > 0, unless A > n — 2 (Fitzgerald 
and Horn 1977). P.d. matrices A such that A oX is p.d. for each A > are 
called infinitely divisible. 

P.d. matrices are referred to as kernels in the Machine Learning commu- 
nity (see e.g. Haussler 1999; Cristianini and Shawe- Taylor 2003; Hofmann, 
Scholkopf and Smola 2008; and references therein). One of the most popular 
kernel is the so-called radial basis function or Gaussian kernel exp(-XDij). 

Theorem 4 (Infinitely divisible kernels) Let C = (c^) be a symmetric 
matrix, and define B = exp(o — C), that is bij = exp(— cy). Then 

B is infinitely divisible 44> C is c.n.d. 

Proof: (Horn and Johnson 1991 p. 456): consider the matrix aij(X) = 
(1 — bij)/ A. If B is infinitely divisible, then (z,A(a)z) < for any vec- 
tor z summing to zero, that is A(X) is c.n.d. for any A > 0. Hence 
lim A ^ + a.y(A) = — ha. bij is c.n.d., showing "=>". Conversely, suppose C 
is c.n.d., and define F = —H{a)CH' '{a) where H{a) is the centering ma- 
trix of Section 12.11 By Theorem [lj F is p.d., and so is exp(oF). But 
exp(fij) = exp(— dj — rji — rjj) since = —Cij — rji — rjj for some r\. Hence 
bij = exp(— Cij) = exp(rji) exp(fij) exp(rjj) is of the form Uaijtj with A p.d, 
and hence p.d. By the same reasoning, b^ = exp(— Ac^) is p.d. for any 
A > 0, since AC is c.n.d. iff C is c.n.d., thus proving "<£=". □ 

Corollary 1 (Gaussian kernel) Let Dij be a squared Euclidean distance. 
Then, for any A > 0, exp(— XDij) is p.d., and L>ij(X) = 1 — exp(— XDij) is 
a squared Euclidean distance. 
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Proof: the first assertion follows form Theorem U and the second from 
Theorem [2] together with the fact that D^ (A) can easily be shown to be 
c.n.d. with a zero diagonal. □ 

More generally, any mixture of D(X) over A > is a squared Euclidean 
distance, yielding the following definition and theorem: 

Definition 2 (Schoenberg transformations) A Schoenberg transforma- 
tion is a function <p(D) from M + to M + of the form (Schoenberg 1938a) 

where g(\)d\ is a non-negative measure on [0, oo) such that ^^-dX < oo. 
Note that Q entails ip{D) > and tp(0) = together with 

POD 

ip'(D) = / exp(-AD) g(X) d\ (10) 
Jo 

where <p'(D) denotes the derivative of <p(D). 

Theorem 5 (Fundamental property of Schoenberg transformations) 

Let D be a n x n matrix of squared Euclidean distances. Define the compo- 
nents of the n x n matrix D as Dij = ip(Dij), where (f(D) is a Schoenberg 
transformation. Then D is a squared Euclidean distance. 

It follows from above that all componentwise transformations of the form 
D^ = (p(Dij) transform a squared Euclidean distance into another squared 
Euclidean distance. In his paper (1938a), Schoenberg indeed proved (The- 
orem 6 p. 828) that all such transformations are given by Definition [2j 
More precisely, Schoenberg addressed and solved the question of determin- 
ing the class $ m of all the transformations D = <f(D) of squared Euclidean 
distances D, associated to any configuration in W, which are isometrically 
embeddable in an Euclidean space of sufficiently large dimensionality, that 
is in an Hilbert space By construction, $i D $2 ^ • • • ^ $00 and 

Definition [2] characterizes the class = n p >i<I ) p. The class $1 is central 
to Brownian and fractional Brownian motion (see e.g. Alpay et al. 2009), 
while lower-order classes <& p <3 are fundamental in Geostatistics (see e.g. 
Christakos 1984) and spatial interpolation (see e.g. Micchelli 1986; Stein 
1999). 
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3 Some properties of the Schoenberg transforma- 
tions 

3.1 Complete monotonicity 

By construction, ip'(D) in (|lUp coincides with the class of completely mono- 
tonic functions f{D) obeying (— l) n (D) > (Bernstein 1929). Hence 
Schoenberg transformations are characterized by (f(D) > with </?(0) = 0, 
with positive odd derivatives f'(D), ip"'(D), etc., and negative even deriva- 
tives (f"(D), ip""{D), etc. (see Table 1). 



function g(X) 


transformation (f(D) 


bounded 


rectifiable 


gi (\) = 5(X-a) a>0 
g 2 {X) = 9(X < §) A sin A 
93(X) = exp(— aX) a > 
54(A) = Aexp(— aX) a > 
55(A) = r(1 a _ a) A a 0<a<l 
see Berg et al. (2008) 


ifl(D) - 1 - c ^ aD ) 
„ frrt 0(Z>+exp(-fD)) 

M D ) t+dt 2 — 

p 3 (D)=ln(l + £) 
MD) = ^ 
MD) = D a 

ip G (D) = < a < 1 


/ 
/ 

/ 
/ 


/ 
/ 
/ 
/ 



Table 1: some Schoenberg transformations 

In particular, \[~D is Euclidean whenever D is Euclidean. Also, the 
identity transformation (f(D) = D obtains from g(X) = (5(A). The latter 
contribution can be made explicit in the following variant, equivalent to 
Definition [2] (see e.g. Berg et al. 2008): 

/•oo 

ip(D) = bD+ (1 - exp(-AD)) dfi(X) 
Jo 

where [i is a non-negative measure on (0, 00) such that J* °° d/i(X) < 00 
and b > 0. 

There exists an important literature about Bernstein functions (see e.g. 
Berg et al. 2008; Schilling et al. 2010; and references therein), defined as 
the smooth non-negative functions whose first derivatives are completely 
monotonia Hence, Schoenberg transformations coincide with the class of 
Bernstein functions which are zero at the origin, in the same way that Eu- 
clidean distances are c.n.d matrices with zero diagonal (Theorem [2|) . 

By construction, Schoenberg transformations are closed under composi- 
tion, as exemplified by (p^ = 994 o <p 5 in Table 1. 
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3.2 Arc length; rectifiable and bounded transformations 

A Schoenberg transformation acts as an anamorphosis between Euclidean 
spaces: to any initial configuration of points X, with mutual squared Eu- 
clidean distances D(X), corresponds a transformed configuration X recon- 
structible by MDS from D = 4>(D). By construction, the mapping X(X) is 
unique up to a translation and a rotation. 

Consider a smooth curve C whose arc length is parameterized by s, con- 
taining two close points at mutual distance As. The corresponding distance 
on the transformed curve C is As = vV((As) 2 ). By l'Hospital's rule, the 
ratio of the infinitesimal arc lengths is 

ds As^o As 

which might be finite or not. On the other hand, infinitely distant points 
in the original space might be infinitely distant or not in the transformed 
space: 

Definition 3 The transformation (p(D) is said to be 

• rectifiable if <p'(0) < oo, that is iff J °° g(X) d\ < oo 

• bounded if (p(oo) < oo, that is iff J °° dX < oo. 

3.3 Right angles 

Consider a triangle ijk with a right angle in k. Hence Dij = + Djk by 
Pythagoras' theorem. Yet, in the transformed space, < + Dj^ since 
<p(Di + D2) < ip(D\) + (p(D2), which can be demonstrated by integrating 
(1 — exp(— XDi))(l — exp(—XD 2 )) > as in ©. That is, the Schoenberg 
transformation a of a right angle a = n/2 is in general acute. By the cosine 
theorem 

.... do-Mftj-rtft + o^ 
2\MOiM£> 2 ) 

Under uniform linear dilatation of the original right-angled triangle by a 
factor e > 0, (jlip readily yields that lim^^oo a{e) = ir/3 whenever <p is 
bounded, and lim^o o(e) = ir/2 whenever ip is rectifiable. 
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3.4 Curvature 



Straight lines are bent by Schoenberg transformations: think of a rod whose 
linear distances d between constituents are contracted as, say, Vd. The 
curvature in the transformed space can be measured as follows: consider in 
the original space three aligned points i, k,j with di k = d k j = e and dij = 2e. 
The Menger's curvature k is defined as the limit (Blumenthal 1953 p. 75) 

K = iim — 



>o dij(e) d jk (e) d ik (e) 

where Aij k is the area of the triangle ijk in the transformed space and d 
denotes the length of the corresponding sides. Heron's formula 

16 A?j k = (d^ + dj k + d ki )(—dij + dj k + d ki )(dij — dj k + d k i)(dij + dj k — d ki ) 

yields after simplification 

e^o ip 2 (e 2 ) (ip '(O)) 2 ~ 

where l'Hospital's rule has been used twice in the last equality, under the 
assumption of rectifiability. 



4 Illustrations 

4.1 Grid 

Consider n = 100 points forming the bidimensional grid of Figure [1^), on 
which the transformation <p(D) = D 0A is applied. Figures [Tb) and [Tfc) 
depict the four first dimensions of the transformed configuration, expressing 
altogether 62% of the total inertia. 

4.2 Rod 

Figure [2] depicts the low-order projections (b, c, d, e and f) of the non- 
rectifiable square root transformation D = \[~D of a quasi-unidimensional 
rod of n = l'OOO points, uniformly generated as X\ ~ U(0, 1000) and 
X2 ~ U(0,1) (a). As expected, the transformed rod is bent, although the 
curvature formula of Section [3~4l does not applies here (y'(0) = 00). 

The transformation of a line is called "screw line" by Von Neumann 
and Schoenberg (1941), and "helix" by Kolmogorov (1940) - an adequate 
terminology in view of Figure EJ 
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original space , dim = 1 (relative inertia = 80%) 



c) 



transformed space , dim = 3 (rel. inertia = 9%) 



b) 



transformed space , dim = 1 (rel. inertia = 36%) 



d) 



Figure 1: a) Initial configuration, on which the transformation <p(D) = 
D 0A is applied, b) and c) depict the low-dimensional reconstruction of the 
transformed configuration, obtained by weighted MDS (Theorem 0]) where 
a = f is the uniform distribution, d) Scree graph, proportional to the 
eigenvalues ([8]). 
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Figure 2: Low-order projections (b, c, d, e and f) of the square root trans- 
formation D = \[T) of a finite rod (a). 
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original space , dimension = 1 (56%) °" transformed space",' dimension = 1 (1 1%) transformed space , dimension - 3 (7%) 

Figure 3: Left: three groups of 50 individuals each, uniformly generated 
on concentric circles of radii 1, 3 and 5, with a radial standard deviation 
of 0.1, 0.3 and 0.2, respectively. MDS reconstruction of the configuration 
transformed as <p(D) = 1 — exp(— 0.65 D) (see text), in dimensions 1 and 2 
(center) and dimensions 3 and 4 (right). 



The first MDS dimensions turn out to express 61.0%, respectively 15.1% 
of the relative inertia. Analytic arguments, to be developed in a forthcoming 
publication, demonstrate the corresponding exact quantities to be \ = 
60.8%, respectively ^ = 15.2% for a line. 

5 Application: distance-based discriminant anal- 
ysis 

Consider a collection of objects i = 1, ...,n endowed with p-dimensional 
features, yielding squared Euclidean distances Dij between objects, possibly 
after standardization and /or orthogonalization of the features (Mahalanobis 
distances). Also, suppose that each object belongs to a group g = 1, . . . m. 
An elementary discriminant strategy would consist in assigning each object 
i to the group g whose centroid is the closest to i, that is to assign i to 
argmiiig Di g : this is the linear discriminant prescription of Fisher (1936), 
successfully applied on the Iris Data (n = 150, p = 4, m = 3) with a 
percentage of well-classified individuals as high as 97%. 

The same strategy is bound to fail with the data of Figure [3] (n = 150, 
p = 2, m = 3), reaching a percentage of well-classified individuals of 35%, 
close to the expected value of 33% under random attribution. 

However, linear discrimination can be attempted on Schoenberg trans- 
formations of the original distances, resulting in the algorithm (see ©): 
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parameter a (power transformation) parameter a (logarithmic transformation) parameter a (gaussian transformation) 

Figure 4: Proportion of well-classified individuals, after Schoenberg transfor- 
mation of the original data of Figure[3l a) power transformation ip{D) = D a ; 
note that a > 1 does not corresponds to a valid transformation, and re- 
sults in a decrease of the proportion below the chance level, b) loga- 
rithmic transformation (f{D) = ln(l + aD). c) Gaussian transformation 
ip(D) = l-exp(-aD). 



Distance-based discriminant algorithm: 

1) compute D l§ = £"=i J] » U - § £? fc=1 }jf«D jk , 
where Dij = tp{Dij) 

and fj = I(i e g)/n g (n g — ^2j eg D is the distribution in group g 

2) assign object i to group argmingZ)^. 

Figure ([3]) shows the resulting proportion of well-classified individuals, 
for various one-parameter families of transformations (p(D\a). In this data 
set, the maximum proportion of well-classified individuals reaches 100% for 
the Gaussian transformation (for a > 0.65). That is, a sufficiently vigorous 
Schoenberg transformation succeeds in mapping the initial configuration 
of Figure [3] in such a way that the three groups can be enclosed in three 
associated disjoint hyperspheres. 

On one hand, this result is completely expected: mapping the data into 
a high-dimensional feature space, in which the former become linearly sepa- 
rable, is a routine strategy in the Machine Learning community, developed 
ever since the nineties (see e.g. Chen et al. 2007 and references therein). 
On the other hand, the conceptual, formal and computational simplicity of 
the above, presumably new algorithm, should to be emphasized. 
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6 Conclusion 



The Machine Learning literature contains innumerable algorithms based 
upon Gaussian and other radial kernels: the procedure exposed in Section [5] 
is indeed just one among many possible applications, aimed at illustrating 
the operational content of the theory. Higher-order "principled" embed- 
dings, pioneered by the work of Vapnik (1995) and embodied in this article 
by the class of Schoenberg transformations, are arguably about to be incor- 
porated in standard Data Analysis, to be routinely used in applications, and 
taught at graduate and undergraduate non-specialized audiences. 

Recasting the whole Machine Learning formalism in terms of Euclidean 
distances, rather than in terms of kernels, could efficiently contribute to- 
wards this assimilation: first, the statements in either formalism can be 
translated into the other, at granted by Theorems of Section [2j In partic- 
ular, to the "kernel trick" stating that all the quantities of interest depend 
upon kernels only (and not upon the object features themselves) corresponds 
an equally efficient "distance trick" , stating that Euclidean distances them- 
selves (and not their underlying coordinates) permit to express all the real 
quantities of interest, as in (jSJ), ©, or Section [5} see also Scholkopf (2000) 
and Williams (2002). Furthermore, Euclidean distances are arguably more 
intuitive than kernels, as attested by the development of Geometry and Data 
Analysis (including their non-Euclidean extensions; see e.g. Critchley and 
Fichet (1994) for a review). In that respect, such a revisitation could prove 
itself beneficial, both from the prospect of future scientific developments as 
from a pedagogical point of view. 
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